Towards a Universal Web Wrapper
نویسندگان
چکیده
The wealth of information contained in the world-wide web has created much interest in systems for integrating information from multiple sites. We describe a universal wrapper machine that can learn to extract information from the web given only a set of general rules describing the data domain. It cleanly separates out site-independent and site-specific knowledge from execution implementation. Site-independent knowledge is expressed in user-supplied domain rules, while site-specific knowledge is expressed in automatically-generated context-free grammars that describe site structures. The two are combined by using the domain rules to semantically interpret the parse trees generated by the grammars. The resulting declarative wrapper specifications are easily understandable by humans and can be executed to perform information extraction. Once extracted, tuples can be queried by external agents using a high-level agent communication language.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملRoadRunner: Towards Automatic Data Extraction from Large Web Sites
The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on real-life data-intensive Web sites confirm the feasibil...
متن کاملAn Integrated Architecture for Exploring, Wrapping, Mediating and Restructuring Information from the Web
The goal of information extraction from the Web is to provide an integrated view on heterogeneous information sources. A main problem with current wrapper/mediator approaches is that they rely on very different formalisms and tools for wrappers and mediators, thus leading to an “impedance mismatch” between the wrapper and mediator level. Additionally, most approaches currently are tailored to a...
متن کاملAutomatic Wrapper Generation and Maintenance
This paper investigates automatic wrapper generation and maintenance for Forums, Blogs and News web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. The tree alignment algorithm is adopted...
متن کاملThe Camaleon Web Wrapper Engine
The web is rapidly becoming the universal repository of information. A major challenge is the ability to support the effective flow of information among the sources and services on the web and their interconnection with legacy systems that were designed to operate with traditional relational databases. This paper describes a technology and infrastructure to address these needs, based on the des...
متن کامل